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ABSTRACT 

Mounting concern for student achievement in writing 
has refocused attention on the features of writing assessment 
necessary to represent a student's skill fairly, usefully and 
economically. If writing tests are to fulfill their intended 
function, the writing assignments and evaluative criteria of large 
scale tests and instruction must interrelate. Current practices are 
increasingly criticized regarding relevance to realistic writing 
situations, utility for forming decisions about individual competence 
or program effectiveness, fairness, and legality for sanctioning exit 
requirements. State and district writing assessments should 
re-evaluate their methods consider ing. alternatives proposed by recent 
writing theory and research* Specifying writing goals which 
distinguish between minimum functional goals and desirable goals of 
competence may improve the logic , utility and fairness of test 
procedures. Appropriate writing tasks should be designed to provide a 
full rhetorical context, and time to engage in all parts of the 
writing process. An integrated instruct ional system which targets 
particular writing elements as important basic competencies would 
involve teachers and evaluators in specification of rating 
criteria — whether holistic judgments or several separate analytic 
scores. The technical quality of rating criteria is a problem of 
scale stability and validity. Cost concerns should not outweigh 
concerns for fairness and utility. (CM) 
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INTRODUCTION 

to attain the fundamental goal of language competence, educators, 
students, and parents must have information describing the status and s 
progress of language skills development. Mounting concern for student 
achievement in writing, one of the principal arenas of language devel- 
opment, has refocused the attention of policy makers, evaluators, in- 
structors, and researchers on the features of writing assessment neces- 
sary to represent a student's writing skill fairly, usefully, and eco- 
nomically. While the relationship between procedures employed to evaluate, 
writing in large scale testing and those used in the classroom has his- 
torically been tenuous, the requirements of minimum competency testing 
programs have stimulated research on methods to tighten the .connection. 
These competency testing programs require school systems to assess the . 
status of students 1 basic skill achievement, and then either to certify 
that mirtimal competencies have been attained or signal the need for re- 
mediation and provide repeated opportunities for students to pass com- 
parable test forms. If these writing competency tests are. to fulfil their 
intended function, then the writing assignments and evaluative criteria 
of large scale tests and classroom instruction must interrelate. 

At present, many large scale writing tests bear little resemblance 
to students' classroom writing experiences. Many states and districts 
rely on multiple choice tests that measure sentence-level editing skills 
or passage comprehension. When writing samples are collected, the structure' 



and topic of the writing assignment may call for information and strategies 
that vary considerably from students 1 experiences in and out of the class- 
room. Furthermore, writing samples are often scored rapidly and holistically 
by raters trained to varying levels of precision and accuracy.. Students 
receive a single score purportedly representing the level of their writing 
competence. 

Reactions of practitioners and researchers to such current practices 
are increasingly critical . They find many faults in current writing tests 
~ their logical and psychological relevance to" realistic writing situations, 
their utility for informing decisions -about individual competence or pro- 
gram, effectiveness, their fairness to students and instruction, their le- 
gality for sanctioning exit requirements. This paper suggests that .state' 
and district writing assessments should re-evaluate their current methods 
for assessing student writing competence in light of these criticisms. An' 
accumulating body of literature indicts many of the methods assessments 
now use that have been derived from custom, folklore, and adaptations of 
norm-referenced testing methodology that are inappropriate for the purposes 
of competency assessment. By examining the criticisms leveled at writing 
tests and considering alternatives proposed by recent writing theory and 
research, we may find solutions that will improve the fairness and utility 
of writing assessments, yet remain within reasonable economic bounds. 

' PROBLEM 1; SPECIFYING WRITING GOALS 

Just what is "good". Writing? For schools, a major conflict has been 
to distinguish between realistic characteristics of minimum competence, 
reasonable high school writing exit competence, and the competence of pro- 



fessional writers and "experts." A significant component in this contro- 
versy over "standards" has been the function various types of writing, 
can and/or should have for the student. Thus the discourse aim or writing 
purpose of transactional writing has been identified by many school systems 
as functionally most relevant to the majority of students. At the lower 
grades, expressive writing has been viewed. by some as valuable in its 
own riqht and by others a$ an educational' vehicle for motivating writing 
that will increase fluency and sentence-level .competence. 

Clearly, the schools 1 definition of the target Constrains' the specific 
criteria that will provide logical and empirical evidence that the target 
has been hit. Currently,; goals may relate to two competency levels, a 
mi nimum competency level targeted-, by most state and district minimum com- 
petency testing programs and a reasonably desirable high school exit com- 
petency level implied in many systems' curricular goals. Most competency 
programs emphasize transactional writing in the factual narrative, exposi- 
tory , or persuasive modes. Minimum program goals are often that students 
write a clear, coherent paragraph that makes a point and that' exhibits 
few or no mechanical, sentence-level errors. For high school exit goals, 
English departments set their sights at the multi-paragraph, essay level, 
seeking writing that has a theme or point, that is coherent between, as 
well as within paragraphs, and that exhibits few sentence-level errors. 
While minimum goals generally specify functional writing, high school 
exit goals may expand the types of writing aims or purposes in which 
it is desired that students be competent. By distinguishing between 
minimum and desirable goals, school systems may be in a better 
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position to defend the logic, utility, and* fairness of focused test pro- 
cedures. ' v 

PROBLEM 2: DESIGNING APPROPRIATE WRITING TASKS 
Perhaps the. most common controversy in the design of writing tests 
involves the relative merits of direct and indirect tasks. Indirect, 
usually multiple choice, measures have been defended by test publishers 
( because of their economy and high correlations with essay scores (Godshalk, 
Swineford & Coffman, 1966; Breland & Braucher, 1977). Critics of multiple 
choice tests reject them on logical and psychological grounds. They argue 
that multiple choice tests present primarily editing tasks or comprehension 
tasks and t^at they therefore do not tap the same kinds of mental processes 
required by production tasks (Bourn**, 1966; Quellmalz, 1978; Cooper, 1979). 
Recent empirical studies of students' scores on direct and indirect mea- 
sures indicate considerably lower correlations between writing skill com- 
ponent scores derived from multiple choice and writing samples (Quellmalz 
& Capell, 1979; Quellmalz, Smith, Winters & Baker, 1980; Moss, Cole & Kham- 
paliket, in press). Furthermore, Quellmalz and Capell found multiple 
choice test scores provided less distinctive information about underlying , 
writing skill constructs or traits than did essay ratings (Quellmalz & 
Capell, 1979). In combination, these studies support contentions that 
direct and indirect measures tap different psychological processes. These 
data would also, of course, suggest that multiple choice test scores would 
not serve as fair or useful proxies for actual writing skill. At best, 
multiple choice tests seem to over-estimate skills (NAEP, 1981) since they 
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measure skills presumably enroute to production skills (Skinner, 1957). 

In addition to debate over the form of response required by writing 
tests, there is considerable disagreement about the appropriate structure 
of assignments used to prompt writing. Criticisms of writing tasks are 
that they do not present full rhetorical contexts that sufficiently inform 
students about the writing purpose, topic, audience, writer's role, and in- 
tended criteria (Britton, 1978; Cazden, 1974; Scribne'r & Cole, 1978; Florio,. 
1979). Research shows that writers 1 performance differs when writing-in dif- 
ferent discourse modes, e.g., exposition and narration (Veal & Tillman, 
1971; Crowhurst, 1980; Quellmalz & Capell, 1979; Praeter & Padia,. 1980; ' 
Baker & Quellmalz, 1980). Research also reveals that accessibility of in- 
formation about an assigned topic affects the quality of students' writing 
(Baker & Quellmalz, 1980). Polin. (1980) has found that when writers are 
given extended time and cues about the rhetorical demands of the task dur- 
ing planning or revision, some of them improve in various features of their 
work. In sum, studies of features of the writing task that influence stu- 
dents' writing performance suggest th a+ variations within features such as 
mode of discourse (writing aim) topic, audience, time, and structural cues 
do present different psychological demands. and therefore should be distinctly 
specified. To be clear and fair, the writing task should provide a full 
rhetorical context and time to engage in all parts of the writing process, 
The cost of developing well formed writing prompts is not high, particu- 
larly in comparison to the cost of erroneous inferences about competence' 
made from assessments of writing that students. generate in response to in- 
complete or ambiguous prompts. 
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PROBLEM 3: SPECIFYING SCORING CRITERIA AND TYPE OF RATING SCALE 

Criteria employed for evaluating student writing vary along a number 
of dimensions: from qualitative to quantitative; from genera "i to specific; 
from comprehensive, full discourse features to isolated features; from 
vague guidelines to replicable, objective guidelines. 

At the most qualitative, vague end of the continua are gener al 'im - 
pression scoring schemes where readers apply their own criteria Co give 
the writing a singld global score. Follman and Anderson's ."Everyman" 
procedures (1967) and teachers' A-F grading schemes fall in this category. 
Still providing a single score or quality rating, but guided by slightly . 
more descriptive and acknowledged criteria, are holistic rating schemes 
such as the ETS four or six-point scales which rank papers within a set. 
Teachers' ' use of a letter grade with some' supporting comments might relate 
to this evaluation scheme. Some rating schemes are specific to. discourse 
mode; others, like the primary trait rating method, are specific to. dis- 
course mode and the particular topic (Lloyd-Jones, 1977). The most de- 
tailed scales are analytic rating schemes referencing component features 
of the written product. \ 

Where do these criteria- come from? Criteria for these scales may be 
inferred from features commonly referenced by knowledgeable readers, they 
may be arbitrary, or theoretically- or empirically-based dimensions deemed 
important by the group designing. the scheme. Analytic scales vary in the 
degree to which they comprehensively reference rhetorical, structural, and 
syntactic features, as well as the degree to which criteria for features 
are qualitative or quantitative. In an attempt to be. comprehensive, 
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•the subscales of the Diederich Expository Scale range from "ideas" to 
spelling (Diederich, 1974). Irr contrast, analytic text analysis schemes 
such as T-unit analyses or Halliday and Hassan's measures of cohesion 
focus on isolated components of the written, pi-eee (Hal'TidSty & Hassan, 
1976). Diederich's "flavQr"-'SU6scale is far mere qualitative and judg- 
mental than counts of numbers and types of cohesive ties. In classroom 
evaluations of student writing, grades and teachers. 1 comments, too, may 
reference a range of essay .features such as content, organization, and 
mechanics (Freedman, 1979); or comments may only relate to sentence- 
level problems. 

One issue in developing or using a rating scheme is the meaning of 
writing score(s). From a psychological perspective, does being a "2" 
vs/"4" discriminate between levels of a student's writing competence? 

\ 

At present, there is little research evidence that any sets of criteria\ 

\ 

in actual use are more valid than others for %$cri^nating between levels 6 
of expertise. From a logical perspective, how spec 4/ 1:, replicable, and 
informative are rating criteria? Pedagogieally ; what implications do the 
scores have -for diagnosing strengths and weaknesses? The bases of the 
score, the' criteria, should serve as feedback to teachers, students, and 
parents. To be fair, criteria employed in minimum competency tests should 
specify writing elements that are basic writi nr; skills, e.g., organiza- 
tion, support* mechanics. The criteria should also be those amenable to 
instructional intervention. The more judgmental , qualitative, sophistica- 
ted, and less teachable writing elements such as flavor, style, or voice 
would seeip less fair and less useful, and would therefore be inappro- 
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priate as rating criteria'for judging basic writing competence. Speci- 
fication oficriteria may be the most important, decision affecting the 
utility of information provided by assessment, bo.th large scale and class- 
room level. Certainly, consensual decisions on these criteria should, 
involve instructional and evaluation personnel. 

*It seems logical that criteria used in large scale writing competency 
assessment should reflect, if not derive from, criteria used to evaluate 
student classroom writing. An idealjy integrated instructional system, one 
which targets particular writing elements as important basic competencies, 
would involve teachers and evaluators in specification of rating .criteria 
and encourage fociised classroom guidance, feedback, and evaluation on 
these element$. Instructional ly, specification of valued basic criteria 
could provide a more comprehensive framework for teachers to focus instruc- 
tion and' communicate feedback to students about their writing. ; The scanty 
research on classroom evaluation methods suggests that teacher-comments 

more often cite easily identified sentence-level mechanical errors than 

> 

text level feedback such as organization and support (Pitts, 1978; Quellmal 
Baker, & Enright, 1980). As Coffman pointed out, while few would recom- • 
mend complete restriction and regulation of the criteria teachers use in 
classroom writing assessment, r.either would they condone subjecting stu- 
dents and the instructional program to wildly fluctuating, idiosyncratic 
standards of individual teachers (Coffman, 1971). Some standardization 
of waiting criteria seems particularly critical" for minimum competency 
goalsi And, of course, schools. using the same criteria for system-wide 
and cl&ssroom assessment would eyentually reduce the cost of training raters M 



1 9 



. Assuming that logical fair, and useful assessment criteria have 
Been specified, the format for recording scores remains a problem. 'Many large 
scale assessments report a single, holistic score. A logical question 
is'whether it makes sense to comment on component features of a student's 
writing instead of, or in addition to, its overall quality. A likely 
question to be raised about a single global score by a teacher, student, • 
parent (or lawyer) is "Why? 11 followed by "Show me." While writing theory 
may .suggest that the "whole" is greater than the sum of its parts, research 
in psychology and pedagogy suggests that learners advance when taught how 
to use components and combine them into competent performance (e.g., 
Skinner, 1957; Resnick, 1980). Another logical question is whether students 
are differently classified as masters and non-masters and/or if analytic 
schemes yield a differential score profile. Winters (1973) found that 
various scoring rubrics including a general impression scale, two analytic 
scales and a T-unit analysis, did classify students differently. Quellmalz, 
Smith, Winters & Baker (1980) found that three separate holistic rubrics ; 
and an analytic rubric classified entering freshman differently. Similarly, 
Polin (1980) found very low correlations between primary trait and analytic 
ratings of the same essays. Each of these studies compared scoring rubrics 
which referenced some similar criteria but which, in application, produced 
variable* characterizations of the same essays. Still unexamined are the 
cost benefits of scales using the same criteria, but recording a single, 
holistic judgment vs. several separate analytic scores. Such a study is 
currently in progress (Quellmalz, 1981). 

A major problem for large scale writing assessments, to be sure, is 



the cost of providing detailed ratings. In the narrowest sense, cost is 
measured in terms of time required to train raters^and time required to 
rate papers. Generally, training on more criteria that are very explicit 
requires more time than training on fewer or less explicit criteria. 

Currently available data on scoring costs indicate that training time 
for holistic and primary trait scoring averages two to four hours (Powliss, 
Bowers, & Conlan, 1979; Mullis, 1980), and for analytic scoring averages 
six to eight hours (Smith, 1978; Quellmalz & Capell, 1979). Trained raters 
can reliably assign aholisticor primary trait score to a student's paper 
in 30 seconds to 1% minutes (Powliss et al,, 1979; Mullis, 1980). Rating 
time for providing five to eight separate analytic scores range from four 
to five minutes for multi -paragraph essays and from two to four minutes 
fpr ( paragraphs (Smith, 1978; . Quellmalz & Capell , 1979). 

' In a recent study comparing two score formats — an analytic scheme 
or a holistic scheme modified to provide diagnostic checks for students 
rated below mastery —Quellmalz found that average rating times per paper 
differed by. approximately one minute (Quellmalz, 1981). Is the additional- 
training and rating.time "worth it?" School systems weighing this question 
might consider broader definitions and implications of cost. First, the 
cost of either analytic or holistic training could be jointly shared as an 
inservice activity by curriculum budgets. These training costs would also 
then decrease when all teachers in a system were trained and thus would re- 
quire only periodic review of the procedures. A second potential cost 
sharing strategy is to view essay ratings as diagnostic components of the 



instructional system to both focus and monitor program improvement. A 
third cost concern is an ethical one. Students spend considerable time 
producing writing samples and the psychological and opportunity costs of 
making uninformed or erroneous decisions of student failure can be profound. 
Finally a system might consider the degree of specific supp'ort useful for 
defending mastery/non-mastery classifications; the costs of remediation 
and lawsuits because of misclassif ica^ions can be high. 

PROBLEM 4: ' TECHNICAL QUALITY OF RATING CRITERIA 
A fundamental responsibility of an assessment program is the documen- 
tation of its technical quality. For writing assessments this becomes a 
problem of scale stability and validity, i.e., demonstrating that score 
criteria are applied uniformly within and between rating occasions and that 
other measures of student writing competence corroborate the test ratings' 
(Quellmalz, 1980). 

When carefully structured scale training sessions precede actual rat- 
ing, most holistic and analytic rating scales can. demonstrate high inter- 
rater reliability (Powliss et al . , 1979; Mullis, 1980; Quellmal-z, 1980; 
Steele, 1979; Van Nostrand, 1980), But inter-rater agreement within a 
rating session is not sufficient for demonstrating scale rel iability. Anal- 
ogous to the problem of test-retest reliability, a reliable scale must be 
stable , i.e., demonstrate that its criteria would be applied consistently 
by new sets of raters. to both "a new set of papers and to the set of papers 
scored by the first raters. To the extent that criteria are differently 
applied, the scale is not stable and reliable tQuellmalz, 1980), 
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Few scales currently used in writing assessment report data about 
their stability across sets of raters and rating occasions. It seems 
that scales w'th more explicit and operational criteria are less sus- 
ceptible to fluctuating judgments and , are more likely to be stable, 
across paper sets and raters. Holistic scales such as the ETS method, 
which awards scores according to a paper's ranking within a unique set 
of papers, result in a sliding scale (Con] an, 1979). A w 2" paper in 
one paper set may well have characteristics quite different from a H 2" 
paper in. a set of papers with a broader or narrower quality range. While 
some attempt is made to stabilize judgments/across sets of raters by in- 
sertirig anchor papers during training, anchor papers are less' frequently 
interspersed in actual rating 'sequences./ Statistical evidence of the 
comparability of -scores given on any such anchor paper by'different groups 
of raters is noticeably, and seriously, \absent. Thus, holistic scales 
using ranking procedures within sets, and unexplicated criteria are suitable 
for norm-referenced selection decisions, but can not meet competency test ; 
requirements for stable, uniform application of criteria. On the other 

7 

hand, holistic scales based- on more descriptive criteria, such as the pri- 
mary trait method (Lloyd-Jones, 1977), may be more likely to permit stable 
application across paper and rater sets. Reports formost analytic scales 
also* document inter-rater reliabil ity within rating occasion but do not 
track stability across occasions. For analytic as well as. holistic scales, 
precision of criteria is a critical factor in achieving scale stability. 
School systems designing writing assessments should routinely report inter- 
rater reliability and check scale stability on common paper sets scored 

26 
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at different rating sessions. These measures will reassure stakeholders 
that assessments are uniform and fair. 

The task of documenting- the validity of. writing assessment rating 
scales can take several forms. Most competency-based writing assessments 

i ' 

attempt to ; establish content validity through expert judgments about the 
skills assessed (Breland & Ragosa, 1976). Few, writing assessment programs 
go on to subject the rating scales used to evaluate those skills' to con- 
tent validity scrutiny. Since, for written production., the scale [defines 
what acceptable writing is, the .content validity of scales should be judged 1 
by the. same procedures as test items or specifications. It may be that 
some scales with vague criteria or criteria heavily weighted toward sentence 
level mechanics would not get the stamp of approval from a broad range of 
experts. It should be noted that holistic scales wjth no explicit criteria 
are "content 11 free and assignment specific. These scales are not suitable 
for competency assessments. 

Of course, content validity is only one index of validity (Cronbach, 
1971; Messick, 1975). Concurrent or predictive and construct validity 
should also be examined. The most common method for validating large scale 
rating schemes has, been to report their correlations with other writing- 
related measures including other English grades, reading test scores, and 
multiple choice writing test scores. . Many of these "criterion 11 variables,, 
however, are even more questionable indicators of writing ability thanthe 
rating scal.e being validated. A major problem in validating rating scales 
is identifying appropriate criterion groups and test scores (Winters, 1978; 
Quellmalz, Spooner-Smith, Winters, & Baker, 1980).. A directly related 
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criterion would be relationships of immediately preceding and subsequent 
writing assignment scores. Unfortunately, as different criteria are often 
employed in other rating scales and/or in teachers 1 ; grading of assignments,, . 
few appropriate direct comparisons are possible. 

From the student's viewpoint, this problem raises concerns for fair- 
ness and instructional validity. How closely do the criteria used in the v 
assessment match those used in the classroom, and how closely do they 
represent writing skills for which the student has received instruction? 
Fundamental precepts of fairness require that if a system hasn't explicitly 
taught the skills, ijt shouldn't hold the student accountable for being 
competent in these skills. For example, originality., humor, and flavor 
are desirable features of writing; they are not often directly taught.. If 
we have no information on the criteria used in holistic scoring, that method 
isn't fair; we have no way to determine if what was tested was what was : 
taught. The legal implications of this dilemma are obvious. 

SUMMARY • 

Balancing ideally detailed analyses of. students* writing with the 
qpsts of those analyses is no easy task. School systems and teachers, across 
the country are wrestling with the problem and arriving at varying solutions. 
Some systems don't even try to initiate large scale rating of writing samples:' 
' Some teachers assign little writing and provide cursory or global feedback. 
Other systems are willing to pay the price and. mount articulated writing 
assessment and instructional systems (e.g., Detroit, Los Angeles, Pittsburgh) 
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Some rating schemes app.ly explicit, repl icab7a, c reasonable criteria; 
some scales are silly, some are misapplied, some are downright harmful. 

Large' scale assessments can devise ways to reduce the costs of train- 
ing raters to score large numbers of essays. In an ideally integrated 
assessment system, tasks and criteria for the large scale assessment would 

be the same as those used in the classroom, A district or state might 

f . . . 

construct a scale that referenced basic text components used by class- 

i • - 
room teachers, e.g.,, main idea, coherence, support, mechanics, and devise 

a jscoring system which checks off papers as competent on each skill 
and also checks off in more detail the components falling below mastery. 
For example, one paper might have competent support and receive a mastery 
check; another essay might not and get a check because "details are not 
related to the main point,' 1 or "details are not concrete." 

Systems might allocate the cost of training raters to staff develop- 
ment. All teachers could be trained in-applying the rating criteria | 
which should promote greater articulation of the' formal assessment 
wiih classroom criteria. Districts such as Detroit find it cost effective 
to pay lay personnel to rate writing samples. Alternately, the system 
flight ask teachers to swap papers.' Teachers could use the rating scale 
to score. writing of other students in the district in return for having 
their students' writing scored by other teachers trained as raters. This 
would reduce training costs for district scoring. Many alternative logistic 
could be engineered to spread the time and energy costs efficiently with- 
in existing system resources. . 



Critics of writing assessment are questioning the fairness and uti 1 - 
ity of these assessments. Too many school systems cite cost as the reason 
that they cannot provide more valid, useful assessment. We think the 
technology and ingenuity exists to devise more defensible writing assess- 
ments now . We should no longer permit concern for cost to outweigh con- 
cern for fairness and utility. 
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